Edit Distance with Duplications and Contractions Revisited
نویسندگان
چکیده
In this paper, we propose three algorithms for the problem of string edit distance with duplication and contraction operations, which improve the time complexity of previous algorithms for this problem. These include a faster algorithm for the general case of the problem, and two improvements which apply under certain assumptions on the cost function. The general algorithm is based on fast min-plus multiplication of square matrices, and obtains the running time of O ( |Σ|n log logn log2 n ) , where n is the length of the input strings and |Σ| is the alphabet size. This algorithm is further accelerated, under some assumption on the cost function, to O ( |Σ| ( n + nn ′2 log logn′ log2 n′ )) time, where n′ is the length of the run-length encoding of the input. Another improvement is based on a new fast matrix-vector min-plus multiplication under a certain discreteness assumption, and yields an O ( |Σ| n 3 log2 n ) time algorithm. Furthermore, this algorithm is online, in the sense that one of the strings may be given letter by letter. As part of this algorithm we present the currently fastest online algorithm for weighted CFG parsing for discrete weighted grammars. This result is useful on its own.
منابع مشابه
Genomic Distances under Deletions and Insertions
As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multise...
متن کاملModels and Algorithms for Comparative Genomics
The deluge of sequenced whole-genome data has motivated the study of comparative genomics, which provides global views on genome evolution, and also offers practical solutions in deciphering the functional roles of components of genomes. A fundamental computational problem in whole-genome comparison is to infer the most likely large-scale events (rearrangements and content-modifying events) of ...
متن کاملExemplar or Matching: Modeling DCJ Problems with Unequal Content Genome Data
The edit distance under the DCJ model can be computed in linear time for genomes with equal content or with Indels. But it becomes NP-Hard in the presence of duplications, a problem largely unsolved especially when Indels are considered. In this paper, we compare two mainstream methods to deal with duplications and associate them with Indels: one by deletion, namely DCJ-Indel-Exemplar distance;...
متن کاملThe Normalized String Editing Problem Revisited
Marzal and Vidal [8] recently considered the problem of computing the normalized edit distance between two strings, and reported experimental results which demonstrated the use of the measure to recognize handwritten characters. Their paper formulated the theoretical properties of the measure and developed two algorithms to compute it. In this short communication we shall demonstrate how this m...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کامل